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We present the Gaussian process density sampler (GPDS), an 
exchangeable generative model for use in nonparametric Bayesian 
density estimation. Samples drawn from the GPDS are consistent 
with exact, independent samples from a distribution defined by a 
density that is a transformation of a function drawn from a Gaussian 
process prior. Our formulation allows us to infer an unknown density 
from data using Markov chain Monte Carlo, which gives samples 
from the posterior distribution over density functions and from the 
predictive distribution on data space. We describe two such MCMC 
methods. Both methods also allow inference of the hyperparameters 
of the Gaussian process. 

1. Introduction. We propose a method for incorporating a Gaussian 
process into a prior on probability density functions. While such construc- 
tions have been proposed before [Leonard, 1978, Thorburn, 1986, Lenk, 1988, 
1991, Csato, 2002, Tokdar and Ghosh, 2007, Tokdar, 2007], ours is the first 
that allows a procedure for drawing exact and exchangeable data samples 
from a density drawn from the prior. We call this prior and the associated 
procedure the Gaussian process density sampler (GPDS). Given data, this 
generative prior allows us to perform inference of the unnormalised density. 
We present two Markov chain Monte Carlo (MCMC) algorithms for perform- 
ing this inference, one based on exchange sampling [Murray et al., 2006] and 
the other based on inferring the latent generative history. In both cases we 
are also able to infer the parameters governing the covariance kernel, and 
draw samples from the predictive distribution on data space. 

Bayesian nonparametric inference is appealing because it allows models 
to include an arbitrary number of parameters, without requiring expensive 
dimensionality-altering computations for inference. The most popular tool 
for nonparametric Bayesian modeling of an unknown probability measure is 
the Dirichlet process [Ferguson, 1973] and related constructions (e.g., Pit- 
man and Yor [1997] and Ishwaran and James [2001]). Samples from the 
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Dirichlet process, however, are discrete distributions with probability one. 
For many inference problems, we wish to model probabilities on continuous 
spaces and in such problems our prior beliefs are often best captured by a 
distribution over probability density functions. 

To fill the gap between nonparametric priors on discrete distributions 
and nonparametric priors on continuous densities, the Dirichlet process is 
frequently used to add a countably-infinite number of parameters into a con- 
tinuous model. The most popular example is the infinite mixture of para- 
metric distributions [Escobar and West, 1995], another example is kernel 
convolution [Lo, 1984]. The Dirichlet diffusion tree [Neal, 2001, 2003] and 
Polya trees [Lavine, 1992, 1994] provide more direct nonparametric priors on 
distributions and, in contrast to the Dirichlet process, can produce densities. 
All of these priors are based on beliefs of an underlying structure, either a 
clustering or tree-based hierarchy. 

Prior beliefs about a distribution over data are sometimes best expressed 
directly in terms of the probability density function — its continuity, support 
and smoothness properties, for example. There is a rich literature on incorpo- 
rating prior beliefs about functions into nonparametric Bayesian regression 
models, using splines, neural networks and stochastic processes (e.g., Di- 
Matteo et al. [2001], MacKay [1992], and O'Hagan [1978]). However, priors 
on general functions have largely resisted application to density estimation, 
due to the requirements that probability density functions be nonnegative 
and integrate to one. This work introduces the first fully-nonparametric 
Bayesian kernel method for density estimation that does not require a finite- 
dimensional approximation to perform inference. 

2. The Gaussian process density sampler prior. The GPDS pro- 
vides a probability distribution on a space X, which we call the data space. 
In many problems, X is the D-dimensional real space M. D . 

We first place a Gaussian process prior over a scalar function g{x) : X — ► R. 
This means that the prior distribution over any discrete set of function 
values, {g(x n )}n = i, is a multivariate normal distribution. These distribu- 
tions can be consistently defined with a positive definite covariance func- 
tion C(-, •) : X x X — * M and a mean function m(-) : X — > M.. The mean and 
covariance functions are parameterised by hyperparameters 9. For a more 
detailed review of Gaussian processes see, e.g., Rasmussen and Williams 
[2006]. 

We construct a map from the function g(x) to a probability density func- 
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tion 1 f(x) via 



(2.1) 



i 



$(s(a;)) ^(x | V) 



where 7t(;k | ^) is a parametric 6ase density that corresponds to an arbi- 
trary base probability measure on X, with hyperparameters ip. The func- 
tion <]?(•) : R — ► (0, 1) is a positive function with upper bound 1. We use the 
bold notation g to refer to the function g(x) compactly as a vector of (infi- 
nite) length on which it is possible to perform inference. The normalisation 
constant 2 n [g] is a functional of g{x): 



We include the subscript ir to indicate implicit dependence on the den- 
sity ir(x). Through the map defined by Equation 2.1, a Gaussian process 
provides a prior distribution over normalised probability density functions 
on X. Figure 1 shows several realisations of densities from this prior, along 
with sample data. 

Although we only require that the function <£(•) be positive and bounded, 
it is convenient for inference if it is a bijective map between M. and (0, 1). 
If $(•) is bijective then each function that maps X to (0, 1) corresponds 
to a unique realisation g{x) from the Gaussian process. Sigmoids, such as 
the cumulative normal distribution function and the logistic function, are 
bijective functions with this domain and range. We take <&(•) to be the 
logistic function, i.e., $(z) = 1/(1 + exp(— z)). 

3. Generating data from the prior. We can use rejection sampling 
to simulate samples from a common density drawn from the the prior de- 
scribed in Section 2. A rejection sampler requires a proposal density that 
upper bounds the unnormalised density of interest. In this case, the proposal 
density is n(x \ tp) and the unnormalised density of interest is $>(g(x)) tt(x \ ifi). 
We assume that it is possible to draw samples directly from ir(x \ ip). 

If g{x) were known, rejection sampling would proceed as follows: first 
generate proposals {x r } from the base density ir(x\ijj). The proposal x r 
would be accepted if a variate u r drawn uniformly from (0, 1) was less 
than $>(g(x r )). These samples would be exact in the sense that they were 
not biased by the starting state of a finite Markov chain. However, in the 

1 We will use the word "density," according to the idea that X is R . However, this 
construction would provide a distribution over probability mass functions for countable X. 
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Fig 1: Four samples from the GPDS prior are shown, with 250 data samples. 
The contour lines show the approximate unnormalized densities. In each case 
the base density is the zero-mean circular Gaussian with unit variance. The 
mean function is set to zero. The covariance function is the squared expo- 
nential: C(x, x') = a 2 exp(— \ J2d( x d — x d) 2 /^d)' with parameters varied as 
labeled in each subplot. <£(•) is the logistic function in these plots. 
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GPDS, g(x) is not known: it is a random function drawn from a Gaus- 
sian process prior. We can nevertheless use rejection sampling by "discover- 
ing" g(x) as we proceed at just the places we need to know it, by sampling 
from the prior distribution of the latent function. As the values of g{x) eval- 
uated at the {x r } are consistent with a single draw of the whole function, the 
samples are exact. This type of retrospective sampling trick has been used in 
a variety of MCMC algorithms for infinite- dimensional models Beskos et al. 
[2006], Papaspiliopoulos and Roberts [2008]. Figure 3 shows the generative 
procedure graphically. 

In practice, we generate the samples sequentially, as in Algorithm 3.1, so 
that we may be assured of having as many accepted samples as we require. 
In each loop, a proposal is drawn from the base density tt(x \ ijj) and the 
function g{x) is sampled from the Gaussian process at this proposed coor- 
dinate, conditional on all the function values already sampled. We will call 
these data the conditioning set for the function g(x) and will denote the 
conditioning inputs as X and the conditioning function values as G. After 
the function is sampled, a variate is drawn uniformly from (0, 1) and com- 
pared to the ^-squashed function at the proposal location. If the uniform 
variate falls below &(g(x)) then we accept the proposal, otherwise we re- 
ject. The proposals and their function values are added into the conditioning 
set regardless of whether that proposal was accepted or rejected. The loop 
repeats until we have as many acceptances as are required. 

The sequential procedure is infinitely exchangeable; the probability of the 
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Algorithm 3.1 Generate N exact samples from a density drawn from the prior 
Inputs: • Number of samples to draw N 

• Gaussian process covariance function C(x, x' ; 9) 

• Base density ir(x | ip) 

Outputs: • N samples V = {x n }^ 1 from a random density drawn from the prior. 

1: X <— 0, G <— > Initially the conditioning sets are empty. 

2: I? < — > Initialize the set to be returned. 

3: r <— > Count the number of proposals. 
4: repeat 

5: x r ~ tt(x I ip) > Draw a proposal. 

6: g(x r ) ~ GV(g \ x r , X, G, 8) > Sample from the GP at the proposal. 

7: u r ~ Un(0, 1) > Draw uniformly on (0, 1). 

8: if u r < <b(g(x r )) then > Rejection sampling acceptance rule. 

9: D^PUir > Store the proposal. 
10: end if 

11: X<-XU x r , G<-GU g{x r ) > Update the conditioning sets, even on rejections. 
12: + 1 

13: until ||D|| — N t> Loop until TV samples are accepted. 
14: return T> 



data is the same under reordering. First, the base density draws are i.i.d.. 
Second, conditioned on the proposals from the base density, the Gaussian 
process is a simple multivariate Gaussian distribution, which is exchange- 
able in its components. Finally, conditioned on the draw from the Gaussian 
process, the acceptance/rejection steps are independent Bernoulli samples, 
and the overall procedure is exchangeable. This property ensures that the 
sequential procedure generates data from the same distribution as the simul- 
taneous procedure described above. More broadly, exchangeable priors are 
useful in Bayesian modeling because we may consider the data conditionally 
independent, given the latent density. 

4. Inference. We now consider the problem of inference with the GPDS. 
We observe iV data T> = {x n }^ =1 that we model as having been drawn in- 
dependently from an unknown density fix). We place the GPDS prior of 
Section 2 on fix). The posterior on g is given by Bayes' theorem: 

(4 1) (a I V 6) = P{9 1 9) i^MLl nil $(g(ttn)) njxn I VP 

' J fdg'p(g'\e)(Z^g'))-^Un=iH9'MXx n \^y 

Even with Markov chain Monte Carlo, inference in this model is difficult. 
Evaluating the posterior requires computing two difficult integrals, the de- 
nominator and the normalisation constant Z n [g]. It is common for the 
marginal likelihood in the denominator of the posterior to be intractable; 
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MCMC methods such as Metropolis-Hastings are well-suited for this situ- 
ation. Posteriors such as Equation 4.1 with difficult sums in both the nu- 
merator and denominator are called doubly-intractable. Doubly-intractable 
posterior distributions appear most frequently when performing inference in 
undirected graphical models, where the partition function can be difficult to 
evaluate [M0ller et al., 2006, Murray et al., 2006]. 

To see the difficulty concretely, consider a naive Metropolis-Hastings 
Markov chain on g, with proposal density q{g <— g): 



The functions g and g are infinite-dimensional objects, which cannot be 
evaluated everywhere in practice. In other contexts, such as Gaussian pro- 
cess classification, it is possible to construct the proposal density such that 
the acceptance ratio only depends on the functions at {x n }. Here, the in- 
tractable ratio of normalising constants makes it impossible to evaluate the 
acceptance ratio without knowing the g and g everywhere. We present two 
Markov chain Monte Carlo algorithms that sidestep this difficulty. The equi- 
librium distribution in both cases is the posterior in Equation 4.1, and both 
algorithms take advantage of the exact data generation procedure described 
in Section 3. 

4.1. Exchange sampling. Exchange sampling [Murray et al., 2006, Mur- 
ray, 2007] is a variant of the Metropolis-Hastings method that enables sam- 
pling from doubly-intractable posterior distributions, subject to the require- 
ment that exact samples can be generated from the model. The procedure is 
a simpler alternative to the auxiliary variable method of M0ller et al. [2006]. 
Exchange sampling introduces additional state into the Markov chain that 
is chosen so that the intractable constants cancel out of the Metropolis- 
Hastings acceptance ratio. Murray et al. [2006] used exchange sampling to 
infer the coupling parameters of Ising models where exact data could be gen- 
erated via coupling from the past [Propp and Wilson, 1996]. In the GPDS 
we generate exact samples via the rejection method of Section 3. 

Initially, we apply exchange sampling to the posterior on g using the 
Gaussian process prior as the proposal distribution, i.e., q(g <— g) = p(g \ 6). 
The joint distribution over the data D, the current Markov state g and 
the proposal g is augmented with N "fantasy data" W = {w n }^ =1 . These 
fantasy data live on the same space X as the true data, but are drawn from 
the distribution implied by the proposal g. The augmented joint distribution 
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is 

(4.3) 

p(g, V,g,W\ 6, = p(g \ 0)p({x n }% =1 \ g, ^)p{g \ 6)p{{w n } N n=l \g^). 

Given the current state g, we jointly propose g and W by using Algo- 
rithm 3.1. This algorithm simultaneously draws g from the prior and gen- 
erates the N fantasy data W. We then propose swapping g with g. The 
acceptance ratio of the swap proposal is the ratio of the joint density in 
Equation 4.3 under each setting: 

a i = Xd&W) P{{Xn}n=l I 9, &&\$) 'P{{wn}n=l I 9, VP 
P({x n }n=l I 9, VO J2^KWn=l I 9, lf>) 

(4.4) 

$(g(g w )) ^(g(^n)) 
A = A *($(*„))$($(«;„))■ 

The normalisation constants cancel out, and the functions g{x) and g{x) 
need only be sampled from the Gaussian process at a finite number of loca- 
tions. 

Algorithm 4.1 shows the exchange sampling inference procedure for the 
GPDS. Some amount of bookkeeping is required for this procedure to be 
valid. Specifically, once something is learned about a particular function g(x), 
i.e., sampled from the Gaussian process, it cannot be forgotten until that g{x) 
is discarded. For example, when fantasy data is generated from g(x), as in 
steps 5 to 15, even if x is rejected in step 11, the (x,g(x)) pair must be 
stored in the conditioning set (step 14). If the proposed g{x) is ultimately 
rejected by step 19, only then can the conditioning set for g{x) be discarded. 
Similarly, when the current Markov state g{x) is sampled from the Gaussian 
process at the fantasy data in step 16, this information must be kept if the 
proposal is rejected (step 22). Thus step 22 expands the Markov state with 
every rejection, as information about the current g(x) accumulates. When 
the proposal g{x) is accepted, the Markov state reduces in size, as fewer 
points will typically have been sampled from g(x). An example sequence of 
rejections and an acceptance is illustrated in Figure 2. 

4.1.1. Improving the acceptance rate. For clarity, we introduced the algo- 
rithm with q{g <— g) = p{g \ 9), but this proposal is a poor choice in practice. 
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Algorithm 4.1 Simulate R steps of an exchange sampling Markov chain on p(g | T>) 



Inputs: 



Outputs: 

{g(x„ 

for r 



Number of MCMC iterations R 

Observed data V = {x n }n=i 

Gaussian process covariance function C(x, x' ; 9) 

Base density -k(x \ rp) 

R conditioning sets of function inputs and outputs {X^, G^}r=i 



)>»= 

~~ {^n} n — 1, 
f- 1 . 



QV{g\V, 
G« * 
R do 



> Initialise the function at the data. 
> Initialise conditioning sets. 

> Take R exchange sampling steps. 

> Draw a new function at the data. 

> Initialise proposal conditioning sets. 

> Initialise empty fantasy data set. 

> Run the rejection sampling loop. 
> Draw a proposal from the base density. 

> Draw the function value at the proposal. 
> Draw a uniform random variate on (0, 1). 

> Rejection sampling acceptance rule. 
> Keep the fantasy. 

t> Add proposals to the conditioning sets. 

> Loop until N fantasies are accepted. 
GV(g | W, X M , G (r) ) > Sample the current func. at the fantasies. 



{g(x n )}% =1 ~gV(g\V,0) 
X <— {x„}™ =1 , G <— {g{x n )}n=i 
W^- 
repeat 

w ~ tt(x I tp) 

g(w)~gV(g\w,±,G,e) 
Wfant ~ Un(0,l) 
if Wfant < ®(g(w)) then 

end if 

X^XU^G ^GU g(w) 
until \W\=N 

{g(w n )}n=i 



&exch 



n 



$(g(x n )) $(ff(tl>n)) 

®{g(x n )) $(g{w„)) 



> Calculate the acceptance ratio. 



> Draw a uniform random variate on (0, 1). 
> Apply the Metropolis-Hastings acceptance rule. 

> Keep the new function data. 



u mh ~ Un(0,l) 
if u mh < a exch then 

X (r+1) x j G (r+1) <- G 

else 

x (r+i) ^_ -j^(r) u | u)n j.^ =1 j> Add the fantasy evaluations to the current state 

G (r+1) ^ G (r) [J {s(u , n )}^ =1 

end if 
end for 

return {X (r) , G (r) }? =1 
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(g) New proposed g(x) (h) Fantasies from g(x) (i) Evaluate and compare 
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(j) Reject proposed g(x) 



(k) Propose, fantasise 



(1) Accept proposed g(x) 



Fig 2: A cartoon of three exchange sampling transitions. In the first two 
transitions, the proposals are rejected to demonstrate the expanding Markov 
state due to retrospective sampling of g(x). The third proposal is accepted 
and the previously-accumulated state is discarded, (a) The observed data, 
illustrated as • . (b) The initial g(x) evaluated at the data, shown as ■ . 
(c) The proposed g{x) evaluated at the data, shown as • . (d) Fantasies are 
drawn from g(x), illustrated as °. There is one rejected proposal, shown 
as x , with the corresponding function value illustrated as . (e) g{x) is 
evaluated at the fantasies and the two explanations are compared using 
Equation 4.4. (f) The proposal g{x) is rejected. The Markov state expands 
to include the fantasies, (g) A new g(x), shown in green, is evaluated at the 
data, (h) Fantasies are drawn from g{x) (two rejected fantasy proposals), 
(i) g(x) is evaluated at the fantasies and the explanations are compared, 
(j) The proposal was rejected. The Markov state expands to twelve function 
evaluations, (k) Skipping the intermediate steps, propose, fantasise, evaluate 
and compare, using the function shown in cyan. (1) The proposal is accepted 
and made the new g{x). All of the information about the old function is 
thrown away, but the new g{x) must keep information in its conditioning 
set about the fantasies it generated. 
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To achieve a better acceptance rate, it is better to make conservative, per- 
turbative proposals. This can be achieved by introducing a set of B "control 
points" in X, denoted C = {x^ G X}? =1 . These control points have as- 
sociated function values, which we denote as Q = {g{xb)}^ =l . We assume 
that C is a superset of the observed data, i.e., T> C C. The function values 
at the control points are explicitly included in the Markov state and all ret- 
rospective function draws condition on these points. New discoveries about 
the function continue to accumulate in the conditioning sets as before. The 
difference now is that the conditioning sets are initialised with the control 
points and small, perturbative proposals can be made on the function values 
at those initial points. 

To make this construction explicit, Equation 4.3 is extended to 



(4.5) p(g,g xc ,V,g,g xc ,W\C,9,^) = 

N 



GV(G\C,e) GV(g\ c \g,9)Z n [g} 



-N 



'[[ ${g(x n )) ir(x n \^) 



.n=l 

N 



x q (g <- g) gv(g\ C | g, e) z«[g]~ N II $(g( w n)) <™n I *i>), 

n=l 

where indicates the proposal of the function values at the control points, 
i.e. g = {g(xb)}b = i, and g\ c denotes the function values, excluding those 

at C. The proposal density q(g <— g) can be chosen to take smaller steps 
than the prior draws of the previous section. With the joint distribution in 
Equation 4.5, and using the conditional retrospective exchange sampling as 
before, the acceptance ratio of exchanging the pair (g,g\c) for (g,g~\c) is 

u a\ g(g <- g) gv(g\c,e) " HqM) 

[ j exch " cp g(g - g) gr(g \ c, e) £ ibw) *C5(«»)) ' 

Superficially, this might seem similar to the knot-based imputation method 
of Tokdar [2007]. However, whereas Tokdar [2007] uses knots as a finite- 
dimensional approximation, we use the control points simply to constrain 
the proposal distribution. The control points only initialise the retrospective 
sampling procedure. As we enforce a Gaussian process prior on the function 
values of the control points, the inference procedure still yields the correct 
posterior distribution on the uncompromised fully-nonparametric Gaussian 
process density sampler model. The number and locations of the control 
points are free parameters. 

A new function is proposed by first choosing values at the control points 
close to the existing function. The remainder of the function is drawn from 
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the prior, conditioned on the values at the control points. We always include 
the locations of the observed data as control points, i.e., T> <Z C. This is not 
required for the algorithm to be valid, but is convenient as all proposed func- 
tions must be evaluated at the data in any case. Taking account of the arbi- 
trary proposal density at the control points, q{{g(xk)}^ = i <— {9(, x k)}k=l)> 
the exchange sampling acceptance ratio becomes 



(4.7) 

Oexch-control 



?({$(**)KLi - M**)}f=i) p({g(x k )}ti I o) 

x * $(5(0) $(<?«» 



* *(5(*n)) ^(ffK))' 

The functions drawn from the Gaussian process must still be evaluated at 
a larger conditioning set that includes the locations of fantasies. As before, 
these can be drawn "retrospectively" as needed, but now these Gaussian 
process samples are conditioned on the values at the control points. 

4.1.2. Hyperparameter inference. One of the benefits of the Bayesian ap- 
proach is the ability to perform hierarchical inference. In this case, it allows 
us to infer the hyperparameters 9 of the Gaussian process and the hyper- 
parameters ip of the base density. We augment the exchange sampling al- 
gorithm slightly to sample from the posterior on hyperparameters: before 
proposing a new function g(x), we propose new hyperparameters 9 and ip 
from a proposal density q(9,ip <— 9, ip). When samples of the new function 
are drawn, it is done with these proposed hyperparameters. The new joint 
distribution is 

(4.8) p{g,{x n }% =l ,9^,g,{w n }%= x ,9,i>) = 

p(8, iff) p{g | 9) p({x n }% =1 | g, ip) 

x q(§, i>^8,^) p(g | 8) p( Wli I 9, VO 

where p(9, i/j) is an appropriate hyperprior. The proposal is now to ex- 
change the triplets {g,9,tp) and (g,9,ip). The acceptance of this swap has 
Metropolis-Hastings ratio 

q (0,i/> <-§,#) p(0,#) 



(4.9) ^exch-hyper 



q(9,i>^9,i/j) p{9^) 



x 



§(g(x n )) 7r(x n | $(g(w n )) Tr(w n \ ip) 



n 



=1 $(g(x n )) ir(x n | tp) $(g(w n )) ir(w n \ ip) 

This acceptance ratio generalises straightforwardly to the case with control 
points discussed in Section 4.1.1. 
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4.1.3. Sampling from the predictive distribution. An important task for 
density inference is estimation of the predictive density. The predictive dis- 
tribution arises on data space when the posterior is integrated out. For the 
GPDS, this density is 



The predictive distribution can also be thought of as the distribution on 
the next datum to arrive, given the ./V already seen and taking uncertainty 
into account. In the GPDS, the predictive density in Equation 4.10 is not 
available analytically. We nevertheless have all the tools in place to generate 
samples from the predictive distribution. We do this by using the generative 
procedure of Section 3 to generate additional data after each Metropolis- 
Hastings step. We use a very similar method to Algorithm 3.1, but initialise 
the conditioning set using the current state of the Markov chain. 

4.2. Sampling over latent histories. An alternative to inference via ex- 
change sampling is to model the latent history of the generative process. By 
using the GPDS prior to model the data, we are asserting that the data can 
be explained as the result of Algorithm 3.1. However, we did not observe any 
of the intermediate states of the rejection sampling algorithm, such as the 
number and locations of the rejected proposals, and the value of the function 
sampled from the Gaussian process prior. Nevertheless, Algorithm 3.1 pro- 
vides a well-defined probabilistic model over both the observed data and this 
latent state. By modeling this larger joint distribution we can avoid evalu- 
ating the intractable normalisation constant that would otherwise appear in 
the likelihood function. 

We model the data V = {x n }^ =1 as having been generated exactly as in 
Algorithm 3.1, i.e., run until exactly N proposals were accepted. The state 
space of the Markov chain on latent histories in the GPDS consists of: 1) the 
values of the latent function g(x) at the data, denoted Gn = {g( x n)}n=ii 
2) the number of rejections M, 3) the locations of the M rejected proposals, 
denoted M = {x m }^ =l , and 4) the values of the latent function g{x) at 
the M rejected proposals, denoted Gm = {g( x m)}m=i- The joint distribution 
over the data and the ordered history of the GPDS generative procedure, 
given the hyperparameters, is 



(4.10) 




(4.11) p(V,GN,M,GM\e,^)=GV(G N ,GM\V,M,e) 



r n i m 



x Jl $(g(x n ))ir(x n \ i/>) JJ(l-*(^(aj TO )))7r(x m |^). 
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We sample from the posterior distribution over all unknowns, which is 
proportional to the joint distribution with the observations, T>, clamped. 
The Markov chain algorithm applies three types of update in sequence: 
1) modification of the number of rejections M, 2) updating of the rejection 
locations A4, and 3) modification of the latent function values Qm and Qn- 
We will maintain an explicit ordering of the latent rejections for reasons of 
clarity, although this is not necessary due to exchangeability. At any time 
we could propose a reshuffling of the latent history, subject to it ending 
in an acceptance, and this proposal would always be accepted, as the two 
permutations have the same probability under the model. 

Inference by Markov chain Monte Carlo of the history of a probabilistic 
computational procedure has been studied previously. Beskos et al. [2006] 
sampled from the state of a rejection sampler for diffusions. Murray [2007], 
who coined the phrase "latent history," modeled data as having been the 
result of a Markov chain which had provably mixed via coupling from the 
past [Propp and Wilson, 1996]. Another example is Huber and Wolpert 
[2009], who model the history of the Matern Type III process to perform 
tractable inference. The Church programming language [Goodman et al., 
2008] also exploits this idea, by treating probabilistic procedures as first 
class objects on which inference can be performed. 

4.2.1. Modifying the number of latent rejections. We propose a new num- 
ber of latent rejections M by drawing it from a proposal density q{M <— M) . 
If M is greater than M, we must also propose new rejections to add to the 
latent state. We take advantage of the exchangeability of the process to gen- 
erate the new rejections: we imagine these proposals were made after the 
last observed datum was accepted, and our proposal is to call them rejec- 
tions and move them before the last datum. If M is less than M, we do the 
opposite by proposing to move some rejections to after the last acceptance. 

When proposing additional rejections, we must also propose times for 
them among the current latent history. There are (^Jf^ 1 ) such ways to 
insert these additional rejections into the existing latent history, such that 
the sampler terminates after the Nth acceptance. When removing rejections, 
we must choose which ones to place after the data, and there are (m-jCt) 
possible sets. Upon simplification, the proposal ratios for both addition and 
removal of rejections are identical: 

M>M M<M 

' ^— ; — * , «- , 

q{M^M)( M ^ N M l ) = g(M^M)( M M ^) = q{M ^ M)M\{M + N 
q(M^M)(^ M ) q{M^M)( M +^ 1 ) q{M <- M)M\{M + N -1)1 
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When inserting rejections, we propose the locations of the additional pro- 
posals, denoted M + , and the corresponding values of the latent function, 
denoted Gm- We generate M + by making M — M independent draws from 
the base density. We draw G^ jointly from the Gaussian process prior, con- 
ditioned on all of the current latent state, i.e., (M, Gm, 12, Gn)- The joint 
probability of this state is 



(4.12) p(V,M,M + ,G N ,GM,Gt\9,ip) 



N 



11 ir(x n \ip) $(g(x n )) 



.71=1 



M 



U n(x m \iP) (l-$(g(x m ))) 



.m=l 



M 



Y[ n(x m \ip) 

m=M+l 

xGV(GM,G N ,Gt I \V,M,M + ,< 



The joint distribution in Equation 4.12 expresses the probability of all the 
base density draws, the values of the function draws from the Gaussian pro- 
cess, and the acceptance or rejection probabilities of the proposals excluding 
the newly generated points. When we make an insertion proposal, exchange- 
ability allows us to shuffle the ordering without changing the probability; 
the only change is that now we must account for labeling the new points 
as rejections. In the acceptance ratio, all terms except for the "labeling 
probability" cancel. The reverse proposal is similar, however we denote the 
removed proposal locations as and the corresponding function values 
as Gm- The overall acceptance ratios for insertions or removals are 



(4.13) 



' q(M^M) Ml (M+N-l)\ -pr 
q(M<-M) Ml (M+N-l)\ ^9^Gm 



(1 " <%)) 



if M > M 



^hist— num 



q{ M^M) Ml (M+N-iy. „ if M <M 
, q (M<~M) Ml (M+JV-1)! ll 9£S M { ^ y > > 



A simple and convenient way of implementing this procedure is to make 
limited proposals that either insert or delete only one latent rejection at a 
time. We define a function C(M, N) : N x N + — > (0, 1] and propose inserting 
a new latent rejection with probability £. Otherwise, with with probabil- 
ity 1 — C, we propose removing a rejection. We must, of course, enforce 
C(0,N) = 1. With these limited proposals, the first case of Equation 4.13 
(proposing one new latent rejection, i.e., M = M + 1) can be written as 



(4.14) 



«hist- 



(1 _ C ( M + 1, AQ) (M + N)(1- $(g(x+))) 
C(M,N)(M+1) 
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where x + is the proposed rejection location. The location x + is drawn from 
the base density tt(x \tp). In the second case, if there is at least one latent 
rejection in the current history (M > 0), then the deletion of a single rejec- 
tion is proposed, i.e., M = M — 1. This deletion proposal has Metropolis- 
Hastings acceptance ratio 

(A^\ ((M-1,N)M 
(4.15) a hist _dei - 



(1-({M,N)) (M + N-l) (1 -<%(>-)))' 

where x~ is the location of the proposed removal. The rejection to remove 
is chosen uniformly from among the M currently in the history. 

4.2.2. Modifying latent rejection locations. Given the number of latent 
rejections M and the current latent function, we would like to sample from 
the locations of the rejections. Given the latent function, the locations of 
the rejections are independent. We make perturbative proposals of new lo- 
cations, conditionally sample the function from the Gaussian process and 
then accept or reject with Metropolis-Hastings. 

The current locations of the rejections are denoted Ai and we draw a 
proposal M from a proposal distribution q(Ai <— Ai). The values of the 
latent function at Ai are denoted Qm and we sample the function at Ai 
jointly from the Gaussian process prior given V, Qn, Ai, and Qm- The 
Metropolis-Hastings acceptance ratio of this proposal is 

Min _ Q(M <— Ai) -^4 n(x m | (1 - 3>(g(»m))) 

{ ' hist -'° CS " q(M - M) ii I VO (1 - *GK*m))) ' 

4.2.3. Modifying the latent function values. Conditioned on the number 
and location of the latent rejections, we must also sample from the latent 
function at both the data and rejection locations. The conditional joint 
posterior distribution is 



(4.i7) p{g N ,g M \M,v,e) = gv{g N ,g M \v,M 

■ N 



,n=l 



■ M 

J] (1-Hg(x m ))) 

.171=1 



This joint distribution is easily sampled using Hybrid (Hamiltonian) Monte 
Carlo [Duane et al., 1987]. For numerical reasons we suggest performing 
gradient calculations in the "whitened" space resulting from applying the 
inverse Cholesky decomposition of the covariance matrix to the function 
values. 
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Algorithm 4.2 implements the latent history algorithm in pseudocode, 
with the simple q(M <— M) that proposes increasing or decreasing the 
number of latent rejections M by one. 

4.2.4. Hyperparameter inference. Given a sample from the posterior on 
the latent history, we can also perform a Metropolis-Hastings step in the 
space of hyperparameters. As in Section 4.1.2, we have hyperparameters 9 
for the Gaussian process and ip for the base density, with joint prior den- 
sity p(0,ip). We introduce the proposal density q(6, ip <— 8, tp) to make 
proposals and ip. The acceptance ratio for a Metropolis-Hastings step in 
the posterior of the hyperparameters, given the latent history, is 



(4.18) cthist-hp 



,y? <-g, vo p(e, 4,) M({g M , Qn} \m,v,§) 

4<-0, ^ p(0, rl>) Af({g M , Qn} \M,V,0) 



' M 

n 



1 n( x r 



N 



yr TT(x n | -0) 

}} 1 TT(x n \ij)_ 



4.2.5. Generating predictive samples. As with the exchange sampling ap- 
proach in Section 4.1.3, it is possible to generate samples from the predictive 
density. As each state in the Markov chain of the latent history inference is 
a rejection sampler state, it is simply a matter of continuing the rejection 
procedure forward to produce a new sample. 

4.3. Calculating the predictive density. We have shown that each infer- 
ence method can yield predictive samples, but it is also natural to require 
that a density model provide an estimate of the normalized predictive den- 
sity itself. We use the method of Chib and Jeliazkov [2001], which consid- 
ers Metropolis-Hastings moves between a pair x and x'. Using the base 
density ir(x | ip) as the proposal density, the detailed balance condition for 
Metropolis-Hastings gives the identity 

(4.19) p{x, g, 9, ip) tt(x' \ ip) min (l, ^ </ ' X ' ' 



<%(*)) 

p(x',g,e,ip) Tr(x\ip) min (l, $ J/^m 



We integrate both sides of this identity over x' and take the expectation of 
each side under the posterior over the function g and the hyperparameters 
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Algorithm 4.2 Simulate R steps of a Markov chain on the latent history 
Inputs: • Number of MCMC iterations R 



• Observed data V = {x„}^ =1 

• Gaussian process covariance function C(x, x'; 9) 

• Base density -k(x \ tp) 

• Location proposal density q(x m *— x m ) 

• Insert proposal probability function C(M, TV) 
Outputs: • R samples of the latent history {M (r) , Q% ] , Q$ }?=i 

1: M <— 0, Qm ^ > Start out with no latent rejections. 

2: Qn ~ QV{g | T>, 0) t> Initialise the function at the data. 

3: for r <— 1 . . . R do > Take R MCMC steps on the latent history. 

4: uq ~ Un(0, 1) > Draw a uniform random variate on (0, 1). 

5: if uq < £(|A4|,iV) then > Decide whether to insert or delete. 

6: x + ~ 7r(a; | t/v) > Draw a proposed rejection location. 

7: g( x+ ) ~ GT{g I a? + , 2?, .M, (5m, <5jv, #) > Draw the proposed function value. 

8- a h - ■ ^ (1-C(|A<| + 1,JV)) (|A4|+AQ (1 - $(g(x+))) > t&nce ratio 

8. a his t-,ns<- C(\M\,N) (\M\ + 1) Acceptance ratio. 

9: tiins ~ Un(0, 1) > Draw a uniform random variate on (0, 1). 

10: if Wins < ahist-ins then t> Metropolis-Hastings acceptance rule. 

11: M <— M U x + , Qm <— Qm U > Add this new rejection. 
12: end if 

13: else if \M\ > then 

14: m ~ [Un(0, |-M|)] > Select one of the M rejections at random. 

(l-C(|A^|,JV)) (|M| + iV-l) (l-$( s (x m ))) 

16: u de i ~ Un(0, 1) > Draw a uniform random variate on (0, 1). 

17: if ttdei < fthist-dei then > Metropolis-Hastings acceptance rule. 

18: M <— M\x m , Qm <— 5jvf\s'(^m) > Remove the mth rejection. 

19: end if 

20: end if 

21: for m *— 1 . . . M do > Loop over the latent rejections. 

22: 4 m ~ q(x m <— a; m ) > Propose a new location. 

23: g{xm) ~ ST^G? | i m , 2?, Al, <5jv, 5m, 0) > Draw a function value from the CP. 

q{Xm <— X m ) 7T(*m) (1 ~ *(ff(x m ))) 

24: ahist-ioc = — ; — 7 rr^ ^7"^ ttt > Acceptance ratio. 

g(a; m <— x m ) n(x m ) (1 - <P(g(a: m ))) 

25: uioc ~ Un(0, 1) > Draw a uniform random variate from (0, 1). 

26: if tt| oc < ahist-ioc then > Metropolis-Hastings acceptance rule. 

27: x m <— ai m , g{x m ) <— g(x m ) > Update the rejection. 

28: end if 

29: end for 

30: Qn,Qm ~ HMC(C/jv, <5m | X>, Al, 0) > Update function via Hybrid Monte Carlo. 

31: Al' r ' <— Al, <— <5jv, <— 5m > Store the current version of the history. 

32: end for 

33: return {M (r \ Q%\ G$}2=i 
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9 and ip: 

j&ej&il> Jdg p(g, 9,ip\V) Jdx' p(x \ g, 6, ij,) n(x' \ i,) min (l, |^py 
Jd6jdipJdgp(g,e,iP\V) jdx' p{x'\g,Q^) tt(x\^) min 

We observe that 

p(g, d,ip\T>) p{x | g, 9, tp) = p(x, g,6,ip\V) = p(x \ V) p(g, 9,ip\x,T> 
and so we may find the predictive density via 
(4.20) 

j v) _ fdefdrl>fdgfM p(9, ij>, g, x> \ V) n(x \ ^ min (l, JggJ 
PX ~ fdejd^fdgfdx' p(e,^,g\x,V)7r(x'\^) min(l,|g^ 

Both the numerator and the denominator in Equation 4.20 are expecta- 
tions. The top is an expectation under the posterior and the bottom is an 
expectation under the posterior where the data has been augmented with x: 



(4.21) p(x\V) 



^p{g,0,ip,x'\V) 
^p(g,e,tp\V,x) 



ir(x | ip) min (l, ^(^j) 



minfl * {9{x>)) 



The numerator can be estimated directly as part of the MCMC inference. 
After each Markov step, generate a predictive sample x' and record the 
transition probabilities. The denominator requires a Markov chain to be run 
with a data set augmented by the predictive location x. At each step in 
the Markov chain, a sample x' is generated from the base density and the 
transition probabilities are evaluated. 

5. Examples. 

5.1. One-Dimensional Bounded Density. We examined the GPDS on the 
one-dimensional problem studied by Lenk [1991] and Tokdar [2007]. It is a 
mixture of an exponential and normal density on [0, 1]: 

(5.1) = L 3 «p{-3x} + i.(^) %x P |-32^-^ 2 |. 
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50 independent observations were drawn from this density and latent history 
inference with the GPDS was applied to it. Figure 3a shows a histogram of 
the observations and the true density. The base density for the GPDS was 
chosen to be the uniform distribution on (0, 1). The covariance function used 
was the stationary squared exponential: 

(5.2) CO, x') = a 2 exp 

and the parameters a and t were included in MCMC sampling for both the 
GPDS and the logistic Gaussian process. The priors used for the Gaussian 
process hyperparameters were 

(5.3) lna = 1, <x = 0.5) 

(5.4) ln^ = 0.05, a = 0.5). 

The Markov chain was simulated for 50,000 iterations, with the first 10,000 
discarded as burn-in. Figure 3b shows the predictive density from the MCMC 
run, along with several posterior samples. Figure 3c shows a histogram of 
the locations of the rejections in the latent history inference, and Figure 3d 
is a histogram of the number of rejections in samples from the latent history. 




GAUSSIAN PROCESS DENSITY MODELING 21 




-2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 2 2.5 -2.5 -2 -1.5 -1 -0.5 0.5 1 1.5 2 2.5 

(a) True density and data (b) Density estimate and rejection snapshot 

Fig 4: Synthetic ring mixture example 

5.2. Two-Dimensional Unbounded Density. We generated 200 indepen- 
dent observations from a two-dimensional location mixture of Gaussians, 
where the means are drawn uniformly from a ring of radius 3/2, centred at 
the origin. The Gaussians have a variance of 1/16 so that the data density 
is 

f 2 ( Xl ,x 2 ) = r&tiNixx; 3/2cosi9,cr = l/A)M{x 2 \ 3/2sin^,cj = 1/4). 

97T 7 — 

We used the two-dimensional isotropic variant of the covariance function 
given by Equation 5.2 and used a Gaussian distribution for the base density, 
inferring the mean and covariance as in Section 4.2.4. We simulated the 
Markov chain for 50,000 iterations, discarding 10,000 as burn-in. The true 
density and the observed data are shown in Figure 4a, while the posterior 
predictive density and a posterior sample of rejection location are shown 
in Figure 4b. As expected, the rejections tend to accumulate in the center 
of the ring, where the base density places mass but the predictive density 
should be low. 

6. Discussion. 

6.1. Computational issues. Computation with a Gaussian process is ex- 
pensive. If the GP is realised on R points, the space complexity of storing the 



22 



R.P. ADAMS ET AL. 



covariance (Gram) matrix is 0{R 2 ) and the time complexity of decomposing 
(or inverting) the matrix is 0{R ? '). The time cost of this decomposition will 
be the asymptotically-dominating factor when performing GPDS inference 
using either exchange sampling or the latent history method. 

6.2. Comparing exchange sampling and latent history inference. Mod- 
eling of probability densities is fundamentally different from regression. In 
regression and classification, one conditions on having seen data in the input 
space when performing inference and prediction. In these cases, it is neces- 
sary only to model the function at places where data have been observed, 
or at predictive query locations. In density modeling, however, the places 
with low density are just as important to the model as those with high den- 
sity. Unfortunately, it is unlikely to have observed data in regions with low 
density, so a representation of the function only at locations where there 
are data is not adequate for the inference we wish to perform. One might 
think of defining a density as analogous to putting up a tent: pinning the 
canvas down with pegs (or stakes) is just as important as putting up poles. 
In exchange sampling, the "pegs" are inferred implicitly as rejections along 
the way to generating fantasy data. At each exchange sampling step, a new 
tent is constructed — complete with its own pegs — and asked to explain 
the data. In the latent history model, however, the tent is modified one piece 
at a time: pegs and poles are inserted, removed, and adjusted gradually to 
explain the data. 

It is possible also to see that the latent history model is likely to re- 
quire fewer samples from the Gaussian process as it proceeds. Consider the 
Gaussian process density sampler: when the latent history method is at equi- 
librium, its state will have some typical number of latent rejections M. This 
is about the same number of rejections as would be expected to occur dur- 
ing an exchange sampling fantasy. However, to find the acceptance ratio in 
exchange sampling it is also necessary to evaluate against the observed data 
after fantasising. This means that the Gaussian process in exchange sam- 
pling requires at least 2N + M evaluations to make a Metropolis-Hastings 
move, while the latent history method requires only N + M. This does not 
even consider the expansion of state that occurs when exchange sampling 
rejects a proposal, and additional fantasy data are incorporated into the 
Markov state. As the time complexity of computation in the Gaussian pro- 
cess grows cubically in the number of data, exchange sampling can become 
rapidly more expensive. 

Another reason that the latent history method is preferable to exchange 
sampling is that it requires less bookkeeping about the function g(x). The 
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state of the exchange sampling Markov chain is the uncountably-infinite 
object g(x). The innovation of the method is that through retrospective 
sampling we are able to make Metropolis-Hastings moves with only a finite 
number of computations. This retrospective sampling, however, means that 
information discovered about a particular g{x) must be retained for as long 
as that function is relevant to the current Markov state. In contrast, the 
state of the Markov chain when performing latent history inference only in- 
cludes g{x) at the latent rejections or thinned events. That is, rather than an 
uncountably-infinite object g(x), the Gaussian process in the latent history 
model conditions on a finite set of points in the input space. This means 
that the values of the function do not need to be kept in memory, except 
for at the data and at the locations of the rejections or thinned events. This 
contrast can also be seen in the difference between the joint distributions 
that describe the two inference methods for the Gaussian process density 
sampler. In exchange sampling, when writing Equation 4.3, we use g to de- 
note g(x) as an infinite vector. When writing the posterior distribution on 
the latent history, however, we do not need to denote an infinite function. 
Equation 4.11 only defines a distribution on the function values at the data 
and the latent rejections. 

Finally, while the latent history method enables efficient Hamiltonian 
Monte Carlo sampling of the latent function values, it is not clear how to 
combine HMC with exchange sampling. 

6.3. Restricting the function space. With both the exchange sampling 
and latent history methods, incorporating fewer latent rejections ("tent 
pegs" ) into the Gaussian process results in improved efficiency. For a given g(x), 
the expected number of rejections is N(Z 7T [g]~ 1 — 1). This expression is de- 
rived from the observation that tt(x \ ip) provides an upper bound on the 
function &(g(x)) ir(x \ if)) and the ratio of acceptances to rejections is deter- 
mined by the proportion of the mass of tt(x \ ip) contained by &{g{x)) ir{x \ ijj). 
One problem with inference is that there are many functions g{x) that can 
explain the data equivalently, as §{g{x)) tt{x \ ip) is unnormalised. Many of 
these g{x) will cause §{g{x)) to be close to zero, resulting in many rejec- 
tions. The Gaussian process prior might only provide weak regularisation to 
prevent this. 

One way to improve this situation is to require that the function g{x) 
be pinned to zero for some xq. This prevents §{g{x)) ir{x \ ip) from being 
small everywhere and reduces the redundancy in the prior that occurs due 
to normalisation. We use the base density tt(x \ i[>) as a prior on xq and 
treat it as a hyperparameter for the Gaussian process. We can then use the 
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inference methods of Sections 4.1.2 and 4.2.4 to infer an appropriate xq. 

6.4. The logistic Gaussian process. The Gaussian process is an appealing 
prior on functions due to the ability to specify the smoothness and differen- 
tiability properties of sample realizations via a covariance function, without 
choosing an explicit set of basis functions. This flexibility and intuition has 
led to interest in applying Gaussian processes to density modeling via the lo- 
gistic Gaussian process introduced by Leonard [1978] and further developed 
by Lenk [1988, 1991]. If g{x) is a random function drawn from a Gaussian 
process, then the logistic GP arrives at a density fix) on a closed interval X 
via 



for i £ I. On a bounded interval and with minor constraints on the co- 
variance function, the integral in the denominator exists and the density 
is well-defined Tokdar [2007]. The distribution on densities is closed under 
Bayesian updating. As in Equation 2.1, however, it is generally impossible 
to integrate an infinite-dimensional random function and so likelihood-based 
calculations are intractable. The GP-based prior we have presented in this 
paper allows exact inference computation despite this intractability by con- 
structing a generative model, but no such method is known for the logistic 
Gaussian process. 

In order to perform inference with the logistic Gaussian process, several 
finite-dimensional approximations have been introduced. Lenk [1991, 2003] 
proposes an approximation of the logistic GP by a truncated Karhunen- 
Loeve expansion evaluated on a grid. Tokdar [2007] uses a finite-dimensional 
approximation to the logistic Gaussian process by parameterizing the func- 
tion values on a grid and then imputing other values from the conditional 
mean. The normalisation constant is estimated via a numeric method, e.g., the 
trapezoidal rule or Simpson's rule. 

The approach of expanding the density as a finite Fourier series, as de- 
scribed by Lenk [2003], is appealing in a single dimension as one can pa- 
rameterize the function in terms of coefficients with independent Gaussian 
priors. The variances of these priors arise directly from the Gaussian pro- 
cess covariance function. As noted by Lenk [2003], however, the number of 
Fourier coefficients required grows exponentially with dimension. Finding 
higher-dimensional bases that are rich enough to express interesting struc- 
ture while also allowing efficient computation is considered an open problem. 

The imputation method of Tokdar [2007] extends to the multivariate case 
more straightforwardly. While a lattice does not scale well to many dimen- 
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sions, the imputation approximation does not necessarily require a grid. 
Tokdar [2007] proposes a method of inferring appropriate knot locations 
and explores this on a two-dimensional test problem using reversible jump 
Markov chain Monte Carlo [Green, 1995]. This has a similar motivation 
to the model presented in this paper: adapt the parameterization of the 
Gaussian process as the data demands. The GPDS achieves this via a fully- 
nonparametric generative model, Tokdar [2007] specifies a finite-dimensional 
surrogate model with the dimensionality selected as a part of inference. Ad- 
ditionally, it is unclear in Tokdar [2007] how the normalization constant is 
to be effectively estimated when the knots are irregularly arranged. It is 
suggested to perform imputation to a grid from the known knots, but this 
reintroduces some aspects of the problems of lattices in high dimensions. In 
contrast, the GPDS inference discussed in the present paper explicitly avoids 
these problems by performing computation without evaluating Z n [g]. 
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